NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

A Modular Static Cost Analysis for GPU Warp-Level Parallelism

https://doi.org/10.1145/3776693

Blike, Gregory; Zicarelli, Hannah; Sathiyamoorthy, Udaya; Lange, Julien; Cogumbreiro, Tiago (January 2026, Proceedings of the ACM on Programming Languages)

Graphics Processing Units (GPUs) are the accelerator of choice for performance-critical applications, yet optimizing for performance requires mastery of the complex interactions between its memory architecture and its execution model. Existing static analysis tools for GPU kernels either identify performance bugs without quantifying costs or cannot handle thread-divergent control flow, leading to significant over-approximations. We present the first static relational-cost analysis for GPU warp-level parallelism that can give exact bounds even in the presence of thread divergence. Our analysis is general and flexible, as it is parametric on the resource metric (uncoalesced accesses, bank conflicts) and on the cost relation (=, ≤, ≥). We establish a soundness theorem for our technique, provide mechanized proofs in Rocq and implement our theory in a tool called Pico. In a reproducibility experiment, Pico produced the tightest bounds in every input, outperforming the state-of-the-art tool RaCUDA in 10 kernels (1.7×better), while RaCUDA produced 4 incorrect bounds and crashed on 2 kernels. In an experiment to measure the accuracy of Pico, we studied the impact of thread-divergence in control-flow in a dataset of 226 kernels. We found that at least 75.3% of conditionals and 85.4% of loops can be captured exactly, without introducing approximation.
more » « less
Full Text Available
Hidden Assumptions in Static Verification of Data-race Free GPU Programs

https://doi.org/10.1007/978-3-031-97492-2_6

Cogumbreiro, Tiago; Lange, Julien (January 2025, Springer Nature Switzerland)

GPUs are massively parallel devices that promise a great return of investment at a cost: GPUs are notably difficult to get right. We discuss a static analysis tool for GPU programs, called Faial, that can detect data-races and data-race freedom. We studied a dataset of 191 data-race free programs and found that 98% needs specific thread configuration to be analyzable, and that 27% needs user-provided assertions to be analyzable. We also report that Faial was able to find data-races in at least 92% of the kernels with missing assumptions.
more » « less
Full Text Available
A Modular Static Cost Analysis for GPU Warp-Level Parallelism - Artifact POPL2026

https://doi.org/10.6084/m9.figshare.30689102

Blike, Gregory; Cogumbreiro, Tiago; LANGE, JULIEN; Zicarelli, Hannah; Sathiyamoorthy, Udaya (January 2025, figshare)

Graphics Processing Units (GPUs) are the accelerator of choice for performance-critical applications, yet optimizing for performance requires mastery of the complex interactions between its memory architecture and its execution model. Existing static analysis tools for GPU kernels either identify performance bugs without quantifying costs or cannot handle thread-divergent control flow, leading to significant over-approximations. We present the first static relational-cost analysis for GPU warp-level parallelism that can give exact bounds even in the presence of thread divergence. Our analysis is general and flexible, as it is parametric on the resource metric (uncoalesced accesses, bank conflicts) and on the cost relation (=, ≤, ≥). We establish a soundness theorem for our technique, provide mechanized proofs in Rocq and implement our theory in a tool called Pico. In a reproducibility experiment, Pico produced the tightest bounds in every input, outperforming the state-of-the-art tool RaCUDA in 10 kernels (1.7× better), while RaCUDA produced 4 incorrect bounds and crashed on 2 kernels. In an experiment to measure the accuracy of Pico, we studied the impact of thread-divergence in control-flow in a dataset of 226 kernels. We found that at least 75.3% of conditionals and 85.4% of loops can be captured exactly, without introducing approximation.
more » « less
Sound and Partially-Complete Static Analysis of Data-Races in GPU Programs

https://doi.org/10.1145/3689797

Liew, Dennis; Cogumbreiro, Tiago; Lange, Julien (October 2024, Proceedings of the ACM on Programming Languages)

GPUs are progressively being integrated into modern society, playing a pivotal role in Artificial Intelligence and High-Performance Computing. Programmers need a deep understanding of the GPU programming model to avoid subtle data-races in their codes. Static verification that is sound and incomplete can guarantee data-race freedom, but the alarms it raises may be spurious and need to be validated. In this paper, we establish a True Positive Theorem for a static data-race detector for GPU programs, i.e., a result that identifies a class of programs for which our technique only raises true alarms. Our work builds on the formalism of memory access protocols, that models the concurrency operations of CUDA programs. The crux of our approach is an approximation analysis that can correctly identify true alarms, and pinpoint the conditions that make an alarm imprecise. Our approximation analysis detects when the reported locations are reachable (control independence, or CI), and when the reported locations are precise (data independence, or DI), as well identify inexact values in an alarm. In addition to a True Positive result for programs that are CI and DI, we establish the root causes of spurious alarms depending on whether CI or DI are present. We apply our theory to introduce FaialAA, the first sound and partially complete data-race detector. We evaluate FaialAA in three experiments. First, in a comparative study with the state-of-the-art tools, we show that FaialAA confirms more DRF programs than others while emitting 1.9× fewer potential alarms. Importantly, the approximation analysis of FaialAA detects 10 undocumented data-races. Second, in an experiment studying 6 commits of data-race fixes in open source projects OpenMM and Nvidia’s MegaTron, FaialAA confirmed the buggy and fixed versions of 5 commits, while others were only able to confirm 2. Third, we show that 59.5% of 2,770 programs are CI and DI, quantifying when the approximation analysis of FaialAA is complete. This paper is accompanied by the mechanized proofs of the theoretical results presented therein and a tool (FaialAA) implementing of our theory.
more » « less
Full Text Available
Sound and partially-complete static analysis of data-races in GPU programs (Artifact)

https://doi.org/10.5281/zenodo.12666681

Liew, Dennis; Cogumbreiro, Tiago; Lange, Julien (January 2024, Zenodo)

{}
more » « less
Formalizing Model Inference of MicroPython

https://doi.org/10.1109/DSN-W58399.2023.00069

De Ferro, Carlos Mão; Cogumbreiro, Tiago; Martins, Francisco (June 2023, International Conference on Dependable Systems and Networks workshops)

Model checking has often been used for verifying Cyber-Physical Systems (CPS). A major challenge is how to capture a model that represents the actual behavior of the software. Model extraction can introduce errors that can affect the accuracy of the analysis including loss of precision, inconsistency, non-conformance, and over- and under-approximations.In this paper, we formalize and prove the correctness of extracting a model from a subset of the MicroPython programming language with respect to a trace-based semantics. The extracted models capture the order of method calls and can be model checked using Shelley. We formalize the extraction process from an intermediate representation of MicroPython codes and prove that the behavior of our intermediate representation is a regular language. Our formalization and theoretical results are fully mechanized using the Coq proof assistant.
more » « less
Full Text Available
Memory access protocols: certified data-race freedom for GPU kernels

https://doi.org/10.1007/s10703-023-00415-0

Cogumbreiro, Tiago; Lange, Julien; Liew, Dennis; Zicarelli, Hannah (May 2023, Formal Methods in System Design)

GPUs offer parallelism as a commodity, but they are difficult to program correctly. Static analyzers that guarantee data-race freedom (DRF) are essential to help programmers establish the correctness of their programs (kernels). However, existing approaches produce too many false alarms and struggle to handle larger programs. To address these limitations we formalize a novel compositional analysis for DRF, based on memory access protocols. These protocols are behavioral types that codify the way threads interact over shared memory. Our work includes fully mechanized proofs of our theoretical results, the first mechanized proofs in the field of DRF analysis for GPU kernels. Our theory is implemented in Faial, a tool that outperforms the state-of-the-art. Notably, it can correctly verify at least 1.42× more real-world kernels, and it exhibits a linear growth in 4 out of 5 experiments, while others grow exponentially in all 5 experiments.
more » « less
Full Text Available
Leibniz International Proceedings in Informatics (LIPIcs):37th European Conference on Object-Oriented Programming (ECOOP 2023)

https://doi.org/10.4230/LIPIcs.ECOOP.2023.13

Jin, Feiyang; Yu, Lechen; Cogumbreiro, Tiago; Shirako, Jun; Sarkar, Vivek (January 2023, European Conference on Object-Oriented Programming)
Ali, Karim; Salvaneschi, Guido (Ed.)
Much of the past work on dynamic data-race and determinacy-race detection algorithms for task parallelism has focused on structured parallelism with fork-join constructs and, more recently, with future constructs. This paper addresses the problem of dynamic detection of data-races and determinacy-races in task-parallel programs with promises, which are more general than fork-join constructs and futures. The motivation for our work is twofold. First, promises have now become a mainstream synchronization construct, with their inclusion in multiple languages, including C++, JavaScript, and Java. Second, past work on dynamic data-race and determinacy-race detection for task-parallel programs does not apply to programs with promises, thereby identifying a vital need for this work. This paper makes multiple contributions. First, we introduce a featherweight programming language that captures the semantics of task-parallel programs with promises and provides a basis for formally defining determinacy using our semantics. This definition subsumes functional determinacy (same output for same input) and structural determinacy (same computation graph for same input). The main theoretical result shows that the absence of data races is sufficient to guarantee determinacy with both properties. We are unaware of any prior work that established this result for task-parallel programs with promises. Next, we introduce a new Dynamic Race Detector for Promises that we call DRDP. DRDP is the first known race detection algorithm that executes a task-parallel program sequentially without requiring the serial-projection property; this is a critical requirement since programs with promises do not satisfy the serial-projection property in general. Finally, the paper includes experimental results obtained from an implementation of DRDP. The results show that, with some important optimizations introduced in our work, the space and time overheads of DRDP are comparable to those of more restrictive race detection algorithms from past work. To the best of our knowledge, DRDP is the first determinacy race detector for task-parallel programs with promises.
more » « less

Search for: All records